Annotation Enrichment Analysis: An Alternative Method for Evaluating the 

Functional Properties of Gene Sets 



Kimberly Glass ^'■^''^^*, and Michelle Girvan^'^ 
^Department of Biostatistics and Computational Biology, 
Dana-Farber Cancer Institute, Boston, MA, USA 
^Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA 
^Department of Physics, University of Maryland, College Park, MD, USA 
'^Institute for Physical Science and Technology, University of Maryland, College Park, MD, USA 

Functional annotation databases provide a wealth of information for evaluating and interpreting 
biological systems. These databases generally exhibit non-uniform annotation properties in which 
many genes are annotated to a few non-specific biological functions, but only a few genes are as- 
sociated with each of the numerous highly-specific biological functions. We investigate how these 
annotation properties influence the predictions made when using standard overlap statistics to de- 
termine the functional enrichment of gene sets. We find such approaches are strongly biased toward 
over-estimating the overlap significance between a gene signature and a functional category if ei- 
ther is associated with an unusually high number of annotations. More specifically, we determine 
the statistical significance of the overlap between randomly generated sets of genes and genes anno- 
tated to Gene Ontology categories using Fisher's Exact Test (FET), a statistical measure commonly 
employed in functional enrichment analysis, and show that the significance level of the association 
between a gene set and a functional category is positively correlated with the both the number of an- 
notations made by genes in that gene set and the number of annotations to that functional category. 
Furthermore, we point out that many published gene signatures include a large number of highly 
annotated genes. To correct for this annotation number bias, we develop Annotation Enrichment 
Analysis (AEA) and use it to evaluate the functional significance of published gene signatures. We 
show that, especially when fully considering the structural properties of Gene Ontology annotations, 
AEA is able to predict biologically meaningful results, most of which are contained in FET, but are 
obscured by the many false-positive enrichment scores that occur in FET due to annotation number 
bias. We suggest that AEA should be used either alongside or in lieu of traditionally-used statistics 
when evaluating the functional enrichment of GO categories in gene sets so as to more accurately 
capture the biological functions represented by those gene sets. 



INTRODUCTION 



1.1. Motivation 



Evaluating the functional properties of gene sets has 
been used both to verify that the genes implicated in a 
biological experiment are functionally relevant to the sys- 
tem in question [35] and to discover unexpected shared 
functions between those genes [311 [41]. This type of anal- 
ysis has become a routine step in understanding high- 
throughput biological data [551 [S3]. It is therefore im- 
portant that functional enrichment analysis return re- 
sults that are both statistically sound and biologically 
meaningful. 

One of the most widely used databases for functional 
annotations is the Gene Ontology (GO) [21 [12]. This 
database is highly regarded both for its comprehensive- 
ness and its unified approach for annotating genes in dif- 
ferent species to the same basic set of underlying func- 
tions [2]. In order to evaluate the level of connection 
between a gene signature predicted by an experimental 
system and the set of genes that are annotated to a given 
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biological function, most functional enrichment analysis 
tools rely on set-overlap statistics [35] . Because these ap- 
proaches are subject to an increase in type I errors asso- 
ciated with multiple hypothesis testing, corrections such 
as the Benjamini, Bonferroni, and FDR are often also 
applied [30] (discussed in more detail in Section 1.2.2). 



Young et al. pointed out that these standard statistical 
approaches are sometimes inadequate, specifically when 
evaluating the functional properties of gene-sets derived 
from RNA-seq data since these experiments are prone 
to selection-bias due to variability in gene length [64] . 
Despite the wide use of functional analysis tools, how- 
ever, little attention has been paid to whether or not the 
underlying properties of the functional databases them- 
selves may contribute to spurious statistical results. For 
example, it is known that the number of annotations to 
functional terms in these databases has a heavy-tailed 
distribution [25] . To address this issue we investigate 
whether this heavy-tailed distribution leads to a bias in 
traditional functional analysis methods. We find that 
the significance level of the association between random 
gene sets and functions in the Gene Ontology appears to 
be positively correlated with the number of annotations 
made to the genes (degree of the genes) in a given gene 
set and the number of genes annotated to a particular 
GO category (degree of the GO term) . 

We investigate the properties of experimentally- 



2 



derived gene signatures, as reported in the Gene Signa- 
tures Database \l5\- We focus on the annotation prop- 
erties of the members of each gene signature, where 
these properties are defined based on the Gene Ontology. 
We find that most signatures include a disproportionate 
number of highly annotated genes. This is likely in part 
due to a non-independence between identified signatures 
and functional annotations, since genes that are involved 
in phenomena such as cancer are highly studied and thus 
more likely to be frequently annotated in databases such 
as GO. Given that these oncogenes are also likely to be 
identified in the experimental systems employed by re- 
searchers seeking to better improve our understanding of 
cancer, a method must be developed to account for the 
statistical bias surrounding gene sets with a larger than 
expected number of annotations. 

Consequently, we propose a method, called Annota- 
tion Enrichment Analysis (AEA), that focuses on the 
overlap in annotations between a set of genes and the 
GO terms belonging to a particular branch of the GO 
hierarchy. By looking at annotation overlap instead of 
gene overlap our approach takes into account the anno- 
tation properties of the Gene Ontology. We then extend 
this concept and develop a structurally-preserving ran- 
domization scheme (SP-AEA) that evaluates the signifi- 
cance of annotation overlap by randomly sampling from 
the annotations contained in the Gene Ontology in or- 
der to even better capture the complex annotation re- 
lationships between genes and functional categories that 
may easily be missed by traditional set-overlap statis- 
tics. Implementations of both approaches are provided 
at [http:/ /www-networks. umd.edu| 



1.2. Background 

1.2.1. Properties of Functional Annotations 

There are many functional annotation databases that 
have been developed in order to help classify genes ac- 
cording their various roles in the cell [HI [Ml ESI 1101 El] ■ 
These databases highlight different aspects of cellular 
function. Here, we focus our analysis on functional anno- 
tations made to the Gene Ontology because of its wide 
use by many functional enrichment tools (for example 
[U O EH EH [53]). Since many of the annotation proper- 
ties of the Gene Ontology are shared by other databases 
[5S], we believe that the methods we develop here could 
be applied to functional enrichment analysis using other 
classification databases. 

The Gene Ontology [5] takes the form of a directed 
acyclic graph (DAG) in which "child" functional cate- 
gories ( "terms" ) can be subclassified under one or more 
other, more general categories, called "parent" terms, us- 
ing "is a" and "part of" relationships. "Branches" in the 
Gene Ontology can therefore be defined as sets of terms 
that contain a parent term and all of its progeny. Note 
that these branches will contain overlapping sets of terms 



since each term can be a descendant of multiple ancestors 
at each level of the DAG. Within this structure, genes 
are annotated to a set of functional categories related to 
that gene's particular role in the cell. These annotations 
are transitive such that a parent term will take on all the 
genes annotations associated with any of its progeny [55j . 
Consequently, terms with many progeny often contain 
many gene annotations whereas terms with few progeny 
generally have fewer associated genes. "Biological Pro- 
cess," "Molecular Function," and "Cellular Component" 
are the three most general terms in GO, defining three in- 
dependent branches such that every other term can only 
belong to one of these three categories. As a consequence 
all genes in GO are annotated to least one, and often all 
three, of these categories. Note that since every parent 
terms take on the annotations of its progeny, the number 
of unique genes annotated to a parent term is the same 
as the number of unique genes annotated to the branch 
defined by the parent term and its progeny; however, the 
total number of annotations made to any parent term 
with children is less than the total number of annota- 
tions made to the corresponding branch (defined by the 
term and its progeny), since individual genes often have 
multiple annotations to terms within a branch. 

Since we want to determine the influence of annotation 
properties on functional enrichment analysis, especially 
in the context of experimental gene signatures in com- 
monly studied diseases such as cancer, we focus our study 
on annotations in the Gene Ontology associated with hu- 
man genes. With this in mind we downloaded informa- 
tion regarding gene-term annotations for human genes 
from the Gene Ontology website (geneontology.org) and 
used this data to construct a gene-term bipartite graph, 
represented as an nc x ut adjacency matrix, where uq is 
the total number of genes and nx is the total number of 
terms listed in the annotation file. In this matrix a value 
of one indicates a known connection between the corre- 
sponding gene and term, and a value of zero indicates 
that the gene is not associated with that term. We will 
denote the uq x n-r adjacency matrix of this bipartite 
graph by B. Thus 



I ^ gene i is annotated to term p 
I if gene i is not annotated to term p' ^ ' 



Many terms are only associated with a small handful of 
genes, while some terms are associated with many genes. 

A histogram of the "degree" of terms (fcj^' — ^^B/p, 

I 

or the number of genes annotated to term p) reveals a 
heavy-tailed relationship (Figure 1(a)). In contrast, a 



histogram of the "degree" of genes (fcg*^ — ^^Bu, or the 

/ 

number of terms to which gene i is annotated) shows that 
although some genes have many more annotations than 
others, a large portion of genes have approximately the 
same number of annotations (Figure JT(b)). 
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Degree of Term (1^) 
(a)Term Degree Distribution 




Degree of Gene (k^) 
(b)Gene Degree Distribution 

FIG. 1: The cumulative degree distributions of (a) genes and 
(b) terms in human GO annotations. 



The "Biological Process" ontology contains a signifi- 
cant fraction of the total annotations. Although all three 
ontologies are used in functional enrichment analysis, it 
is common to focus on this ontology, both for its size 
and because its members describe dynamical processes 
performed by the cell. We will do the same in the follow- 
ing analysis. The total number of annotations made to 
the "Biological Process" ontology is 656783, originating 
from 18930 genes to 10192 terms. Consequently, the av- 
erage number of annotations made by an individual gene 
is 43.2 and the average number of annotations made to 
an individual term is 64.4. These values will be useful to 
keep in mind, especially as we investigate the annotation 
properties of gene signatures and of the terms for which 
they are enriched. 



1.2.2. Evaluating Functional Enrichment m Gene Sets 
using Set-Overlap Statistics 

The most widely used statistics for evaluating which 
functional categories are enriched in a set of genes are 
based on gene counts and include Fisher's Exact Test, the 
binomial test, and the chi-squared test Although 
these statistics vary in exact implementation, they all 
rely on the same basic underlying assumption that all 
genes have an equal probability of being selected under 



the null hypothesis. Of these tests. Fisher's Exact Test 
(FET) is the most common statistic and is used by many 
of the most popular functional enrichment tools (see Ta- 
ble 2 in [30])) and therefore we choose it to represent a 
"typical" evaluation of gene set functional enrichment. 
Although it is recognized that this statistic makes as- 
sumptions in its null hypothesis that fail to reflect the 
complex properties of the Gene Ontology, it is widely 
regarded as a good guide in determining what types of 
functions are represented in a given set of genes. 

FET is related to the hypergeometric probability and 
can be used to calculate the significance, or p-value esti- 
mated using FET {pp{Ngt)) of an overlap between two 
independent sets. For example, given a gene set con- 
taining Ng genes, a GO term with fc( annotations, and 
Ntot total genes annotated in GO, the probability that 
Ngt or more genes belong both to this gene set and are 
annotated to the GO term can be calculated as: 

PF{Ngt) = P{N>Ngt\Ng,kuNtot) 

min[Ng,kt] ^kt^ ^Ntot-kt^ ^^'^ 
~ $Z (Ntot) ■ 

Since most functional enrichment analysis compares a 
gene set to all the terms in GO, multiple- hypothesis test- 
ing corrections are often applied to these p- values [30] . 
These corrections raise the value at which a comparison 
between a gene set and a GO term should be considered 
significant. Commonly used multiple- hypothesis correc- 
tions include the Bonferroni, Benjamini and the False 
Discovery Rate. Of these, the Bonferroni is the most 
conservative. It adjusts the value at which a test is con- 
sidered "significant" by the number of tests made: 

/3 = a/n (3) 

where a is the value at which an individual test was pre- 
viously considered significant, and j3 is the value at which 
an individual test should be considered significant, given 
n repetitions of that test. This is equivalent to multiply- 
ing the p-value obtained from a test such as FET by n 
and using the same critical value to determine whether 
the result is statistically significant. 

The majority of multiple-hypothesis corrections will 
only change the critical value of individual tests, but will 
not affect the rank ordering of these tests. One popu- 
lar exception to this is the False Discovery Rate (FDR). 
The FDR adjusts the value at which a test is considered 
"significant" based on the rank of the predicted level of 
significance: 

P ^ an/r (4) 

where, as before a is the value at which an individual 
test was previously considered significant, /3 is the value 
at which an individual test should be considered signifi- 
cant, given n repetitions of that test, and r is the rank 
of the test. This will provide approximately the same 
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(a) Fisher's Exact Test 
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(b)Annotation Enrichment Analysis 



FIG. 2: The significance (measured by p-value) of GO terms 
in 200 randomly generated gene sets. Tfie terms are ordered 
based on flow many genes are annotated to the term (kt) and 
the gene sets are ordered based on totaf the number of an- 
notations (Mg) made by the 200 genes in that set. There is 
an obvious bias toward significant enrichment between high 
degree gene-set/term pairs in (a) Fisher's Exact Test (FET), 
but this is correctiy accounted for using (b) Annotation En- 
richment Analysis (AEA). 



correction as the Bonferroni for the most significantly- 
ranked p-values but will not adjust tests that are the 
least-significant by rank. As a consequence, the rank or- 
dering of the significance can change slightly when using 
FDR. 



created by first randomly selecting Ng genes. We then 
randomly selected one gene in this gene set (gene i) and 
one gene not in the gene set (gene j). If replacing gene i 
with gene j caused the total number of annotations made 
by genes in the gene set to approach the Alg , we replaced 
gene i with gene j with a high probability {p ~ 0.95), but 
if the replacement caused the average degree of the gene 
set to move farther away from Mg we replaced gene i with 
gene j with a low probability {p = 0.05). This swapping 
continued until the total number of annotations made by 
the gene set was within 0.1% of Mg. 



In this way we created 200 gene sets with Ng 



200 



genes each, but whose average degree {kavg = Mg/Ng) 
varies from approximately 21 to 65, or from around half 
to 1.5 times the expected average degree of 43 (see Sec- 
Using FET, we evaluated the enrichment 



1.2.1) 



tion 

of all GO terms in the "Biological Process" ontology in 
each of these random gene sets. Figure 2(a) shows a heat 
map of the significance of the overlap between each of 
these gene sets and GO terms that have 200 or more 
gene annotations, ordered based on their total number of 
gene annotations. The trend is striking. Gene sets with 
a higher number of annotations tend to have an abun- 
dance of enriched GO terms compared to gene sets with 
a lower number of annotations. Furthermore, GO terms 
with many gene annotations tend to be the most signif- 
icantly enriched, especially in these "high-degree" gene 
sets. This indicates not only that false information is 
predicted for gene sets with many annotations, but that 
information may be lost for gene sets or GO categories 
with a relatively lower level of annotation. 

We point out that although multiple-hypothesis correc- 
tions will sufficiently lower the value at which a p-value 
is considered significant such that either very few or no 
false positives will occur, the biases themselves cannot 
be overcome in this manner. Statistical corrections such 
as Bonferroni modify the threshold at which a result is 
considered significant (see Equation [s]) by taking into ac- 
count the number of different comparisons being made 
in a given statistical analysis, but do not consider the 
properties of the gene sets or terms themselves, thus us- 
ing this correction the ordering of the p-values will not 
change and the bias will remain. Even an FDR correc- 
tion (see Equation |4]) , which can alter the rank ordering 
of significance, is insufficient to overcome this strong sig- 
nal (see Supplemental Figure 1). 



2. METHODS 

2.1. The Effect of Annotation Bias on Standard 
Functional Enrichment Analysis 

To determine the effect of annotation properties on 
GO enrichment analysis we created random gene sets 
in which we controlled the total number of annotations 
made by the genes belonging to each gene set. Each 
set with a desired total number of annotations, Afg, was 



2.2. Correcting for Annotation Bias using 
Annotation Enrichment Analysis 

Clearly annotation properties of both genes and func- 
tional categories can influence the results of functional 
enrichment analysis. In order to mitigate this effect, we 
suggest that instead of considering the overlap between 
two gene sets, as is traditionally done in functional en- 
richment analysis, one instead considers the overlap be- 
tween annotations made to a gene set and a branch of 
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A'^ - number of genes in signature 
^, - number of genes annotated to branch 
A'^, - number of genes common to signature and branch 
M -number of annotations to signature 
M, - number of annotations to branch 
M -number of annotations between signature and branch 



# Gene in signature 

Gene not in signature 
H Term in branch 

1 I Term not in branch 



EXAMPLE 

FET: =4; *, =3; A'„=3; Pp(iVj,) = 0.071 

AEA: M^, =12; M, =4; M^„ =4; Pj(M^,) =0.102 



FIG. 3: An illustration of how AEA calculates the signifi- 
cance between a given gene signature and branch of the GO 
hierarchy (see Equation[5|, and how that same information is 
calculated using FET (see Equation [2|. 



terms in the gene ontology. We call our method Annota- 
tion Enrichment Analysis (AEA). 

Considering an entire branch of terms simultaneously 
is crucial to correctly estimating the annotation overlap 
between a gene set and a biological concept. As an il- 
lustration, consider the "Biological Process" category in 
GO, which is the ancestor of all other terms in the on- 
tology and thus contains annotations from every gene. If 
we only considered the "Biological Process" term alone, 
then the number of annotations shared with this term 
and a gene set will be equal to the number of genes in 
the gene set, but it is more appropriate, given the na- 
ture of the term, for the number of common annotations 
to be equal to all the annotations made to the gene set. 
Using annotations to the full branch (a term and all its 
progeny) will correctly account for this. 

Given that we want to estimate the significance of an- 
notation overlap, one logical approach is to simply count 
the number of annotations made to a gene set, the num- 
ber of annotations made to a branch in GO, and the 
the number of annotations extending between that gene 
set and branch, and then once again use the hypergeo- 
metric probability to determine the significance of this 
overlap. This approach assumes that annotations are in- 
dependent. Unfortunately, this is not the case for GO as 
a gene may only be annotated to a term a single time, 
but independent annotations would imply that the gene 
could be annotated to a term multiple times. 

Acknowledging that we are making some false assump- 
tions regarding the structure of gene-term annotations, 
we suggest that one can still estimate an enrichment for 
annotations between a gene set and a GO branch in the 
following manner. Given Mg annotations to a gene set, 
Mt annotations to terms belonging to a GO branch, and 
Mfot annotations made in the GO ontology, the prob- 
ability of finding Mgt or more annotations in common 



between these two sets can be written as: 

PA{Mgt) = P{M>Mgt\Mg,MuMtot) 

min[Mg,Mt] ( Mt\ ( Mtat- Mt\ 
\— ^ V i / V M„-i I 



(5) 



i=Mgt 



This equation can be used to calculate the significance 
(or p- value using AEA, pj^(AIgt)) of enrichment between 
genes in a signature and genes annotated to a branch in 
the GO hierarchy, taking into account annotation infor- 
mation. A visual representation of how the significance 
of overlap would be calculated using FET compared to 
AEA is shown in Figure [3j 

We determined the significance of all GO branches in 
our randomly generated gene sets with AEA, and cre- 
ated a heat map of these values as we had done for the 
significance values produced using standard set-overlap 
statistics (Figure 2(b) ). The results of AEA are uniform 
across both varying gene set and term degree, demon- 
strating that AEA works well at eliminating annotation 
bias, although the predicted p- values are sometimes mis- 
leadingly low due to the independence assumption (for 
further discussion see Section 3.4 1. 



3.1. 



3. RESULTS 

Annotation Properties of Experimental Gene 
Signatures 



One of the most common applications of GO enrich- 
ment analysis is to ascertain the functional properties 
of an experimentally determined set of genes. Although 
we have demonstrated that AEA corrects for annotation 
bias with randomly generated gene sets, we also want to 
know how well this analysis can recapitulate biologically- 
relevant results. With this in mind we downloaded sig- 
natures as recorded in Gene Signatures Database (Gen- 
eSigDB) [H]. This database is a manual curation of 
previously published gene expression signatures, focus- 
ing primarily on cancer and stem cell signatures |14j . In 
the following analysis we will use all 309 human signa- 
tures from this database that contain at least 100 and 
less than 1000 genes that also are annotated to a term in 
the "Biological Process" ontology. 

First, to assess whether annotation bias might play a 
role in evaluating the functional properties of these gene 
signatures, we determined the average number of annota- 
tions made to the genes occurring in each signature. Fig- 
ure 4(a) shows the number of genes in a signature plotted 



against the average level of annotation for each signature. 
The expectation for a random selection of genes (the av- 
erage number of annotations made to all genes in GO - 
see Section 1.2.1 ) is shown as a red line. The plot suggests 



a strong preference for these signature genes to contain 
many more annotations than expected by chance. Al- 
most a third (99) of the signatures have an average level 
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# of Terms Considered Important 
by FET but not by AEA 

(b)FET vs. AEA on Signatures 



FIG. 4: Annotation properties of experimental gene signa- 
tures, (a) The number of genes versus the average number of 
annotations made to the genes in each signature. Genes from 
signatures generally contain many more GO annotations than 
one would expect if selecting genes randomly (shown in red), 
(b) The number of terms that are considered important (top 
5% by rank) by one of the measures (either AEA or FET), but 
not important (bottom 85% by rank) by the other, plotted for 
each gene signature. The signatures are colored according to 
the average level of annotation {kavg = Mg/Ng). 



of annotation that is greater than any of our randomly 
generated gene sets (see Section 2.1 1 and all but four sig- 



natures have an average level of annotation greater than 
expected by chance. Since we have shown that random 
gene signatures with these annotation levels encounter 
a bias in traditional functional enrichment analysis, we 
believe these experimental signatures arc an appropriate 
biological set with which to evaluate how AEA compares 
to FET when investigating and discovering the functions 
of genes contained in experimental biological data. 



3.2. AEA vs FET on Expression Signatures 

We predicted the enrichment of all "Biological Pro- 
cess" GO terms in these signatures both by traditional 



set-overlap statistics (FET) as well as with AEA. We 
firstly investigated how the level of annotation made to 
a signature influences the results of the two metrics com- 
pared to one another. Although the exact functional cat- 
egories considered most enriched according to the two 
measures may be different, we wanted to test if the two 
measures gave the same general results. In other words, 
are the categories ranked highly by FET also ranked 
highly by AEA and are the categories ranked poorly by 
FET also ranked poorly by AEA. To this end we selected 
the top 5% of terms (510 terms) based on their enrich- 
ment score in FET and AEA to designate as "important" 
according each to these measures. We compared this list 
of terms to the list of terms that are "not important" 
(in the bottom 85% of terms by rank) according to each 
measure. The number of terms considered important in 
AEA but not by FET versus the number of terms consid- 
ered important by FET but not AEA for each signature 
is plotted in Figure 4(b)[ Signatures are colored based on 
the average level of annotation to their member genes. 

In signatures containing the highest level of annota- 
tion, the terms deemed most "important" by FET are 
more likely to be considered "unimportant" according 
to AEA (note that "perfect" correlation between FET 
and AEA would result in a single point at (0,0)). Al- 
though the bias is not quite as strong, for signatures 
with lower levels of annotation, the terms deemed more 
"important" by AEA are sometimes considered "unim- 
portant" by FET. These results are consistent with the 
previous analysis in random gene sets that showed a bias 
by FET to place more significance between gene sets and 
terms with a higher number of annotations (see Figure|2| . 



3.3. Specific Predictions made only by FET and 
not by AEA 

To get of sense of whether the terms that are strongly 
enriched in signatures according to FET but not AEA 
constitute biologically meaningful information missed by 
AEA or uninformative predictions by FET we selected 
the ten term-signature pairs most significantly associated 
by Fisher's exact test that are shown to be equivalent 
to chance {p > 0.5) by AEA. We also investigated the 
ten term-signature pairs most significantly associated by 
AEA that are given a p- value equivalent to chance (p > 
0.5) by FET. These pairs are shown in Table 1 along with 
some of the annotation properties associated with those 
gene signatures and GO terms. 

The selected pairs immediately suggest an FET bias for 
enrichment of signatures that include genes with a high 
number of annotations (as indicated by the large values in 
the kavg column, expected value is approximately 60) as 
well as highly annotated GO terms (as indicated by the kt 
column, expected value is 64.4). Furthermore, the terms 
that appear in the top ten most enriched pairs by FET 
and not AEA (e.g. "cellular process" and "metabolic 
process") are quite general and reveal little information 
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Signature 


GO category 


ka^g 


kt 


FET 


(FDR) 


AEA (SP-AEA) 


Breast (Supp. Table l,l52j) 


cellular metabolic process 


78.76 


6259 


2.24e-85 


(1.16e-80) 


0.99 (0.68) 


Breast (Supp. Table l,|52l) 


cellular macromolecule metabolic process 


78.76 


4382 


1.84C-71 


(5.09e-67) 


1.00 (0.86) 


Breast (Supp. Table l.[52l1 


primary metabolic process 


78.76 


6541 


3.66e-69 


(9.00e-65) 


1.00 (0.88) 


Breast (Supp. Table l.[52l) 


metabolic process 


78.76 


7234 


8.47e-63 


(1.72e-58) 


1.00 (0.95) 


ProteinKinasos (TableS2, IsHl 


cellular metabolic process 


86.00 


6259 


9.58e-55 


(1.42e-50) 


1.00 (1.00) 


Breast (Supp. Table l.[52l) 


macromolecule metabolic process 


78.76 


5088 


5.46C-52 


(6.99e-48) 


1.00 (0.84) 


ProteinKinases (TableS2, ISH') 


primary metabolic process 


86.00 


6541 


7.86e-48 


(8.14e-44) 


1.00 (1.00) 


StemCell (TableSS, l26l) 


cellular process 


82.65 


11520 


8.41e-46 


(7.98e-42) 


1.00 (0.95) 


StemCell (TableS3, [26]) 


nucleo-base, -side, -tide and nucleic acid metabolic process 


82.65 


2884 


2.56C-43 


(2.07e-39) 


0.74 (0.60) 


StemCell (TableSS, EB]) 


cellular nitrogen compound metabolic process 


82.65 


3301 


2.42C-42 


(1.82e-38) 


1.00 (0.75) 


Ovarian (Supp. Table 1, [3]) 


kidney development 


55.00 


128 


0.63 


(1.00) 


8.11C-30 (0.014) 


Breast (Supp. Table 1, gg]) 


kidney development 


50.72 


128 


0.56 


(1.00) 


l.Olc-14 (0.101) 


Lypmhoma (TabloS4, ITTl') 


kidney development 


51.64 


128 


0.66 


(1.00) 


1.42e-13 (0.0505) 


Breast (Supp. Table 1, gg]) 


renal system development 


50.72 


134 


0.61 


(1.00) 


2.54C-13 (0.108) 


Lypmhoma (TabloS4, TP) 


renal system development 


51.64 


134 


0.68 


(1.00) 


l.Ole-12 (0.0554) 


Stomach (TableS2b, i66i') 


cellular process 


38.20 


11520 


0.66 


(1.00) 


1.45e-10 (0.0161) 


Lypmhoma (TableS4, [TT]) 


urogenital system development 


51.64 


169 


0.76 


(1.00) 


9.15e-10 (0.0664) 


Lung (TableS2, ITTl') 


regulation of cellular amine metabolic process 


39.53 


68 


0.51 


(1.00) 


2.53e-08 (0.0414) 


Breast (Supp. Table 6, gS]) 


regulation of T cell activation 


49.39 


211 


0.57 


(1.00) 


3.15e-08 (0.0606) 


Breast (Supp. Table 1, [191) 


heart development 


68.48 


297 


0.52 


(1.00) 


7.12e-08 (0.0845) 



TABLE I: Term-signature pairs considered most significant by FET that are given a p- value equivalent to chance (p > 0.5) by 
AEA as well as pairs considered most significant by AEA that are given a p-value equivalent to chance by FET. Signatures 
are identified based on their cell-type, publication, and table reference, as reported in GeneSigDB. For the pairs enriched in 
FET and not AEA, the average degree of the genes in these signatures (kavg) and the degree of the terms they are enriched for 
[kt) both tend to be of higher degree (even compared to the already high degree of signatures considered in this analysis) and 
it is unclear how the functions uncovered by FET and not AEA are important for the specific biological systems represented 
by these signatures. In contrast, in the pairs selected as enriched by AEA and not FET, the average degree of the genes in 
the signatures {kavg) and the degree of the terms they are enriched for {kt) are much closer to expectation than those that are 
significant by FET, however, the assuming independence between annotations led to several mis-leadingly low p-values where 
a single highly annotated gene is able to drastically effect the estimated significance. 



about the specific biology performed by the genes in the 
signature. Based on this list it appears that the results 
from FET that are discounted by AEA are generally un- 
informative. Although many of the genes in these signa- 
tures indeed perform the listed biological functions, it is 
likely that the enrichment reported by FET is a conse- 
quence of the fact that the genes in the signature perform 
many functions. It is useful to point out that these "spu- 
rious results" have, by no means, questionable p-values. 
Even with multiple-hypothesis corrections (FDR noted 
in parentheses) these values remain small enough to con- 
clude that these gene signatures are involved in these 
particular biological functions, when in reality, this "sig- 
nificant" level of association may merely be due to the 
annotation properties of the signatures and terms them- 
selves. 

In contrast, the term-signature pairs enriched in AEA 
and not FET are no longer biased toward high-degree 
signatures and are not as strongly biased toward high- 
degree terms. The enriched terms selected by AEA and 
not FET are also much more "specific" compared to the 
ones selected by FET and not AEA and may at first ap- 
pear intriguing given cancer's identified connection with 
stem cells (see for example, [43]), and hence developmen- 
tal processes. Further investigation, however, shows that 
the strong enrichment values are largely a consequence 



of the invalid assumption made in Equation [5] that an- 
notations are independent of genes. This is manifested 
in the fact that many of these term-signature pairs share 
many annotations in common, but only one or two genes. 
For example, the genes contained in the top signature 
enriched in AEA and not FET (Ovarian (Supp. Table 
Ij 0)) collectively include a total of 6380 annotations, 
of which 61 extend to either "kidney development" or 
one of its progeny, a branch of GO containing a total 
of 966 annotations. This is a very significant overlap 
(8. lie — 30), given the 656783 total annotations made to 
terms in the "Biological Process" ontology; however, all 
61 annotations originate from the same highly-annotated 
gene, OSRl (263 total annotations in the "Biological Pro- 
cess" ontology). Because we had assumed that annota- 
tions are independent of one another, the information 
that an annotation to OSRl normally is connected to 
many other annotations, was lost when calculating the 
p-value. In other words. Equation [5] effectively ignores 
information regarding the degree of individual genes and 
terms, thus allowing a single high-degree gene in a signa- 
ture to incorrectly influence the resulting p-values. 
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3.4. A Structure Preserving Form of Annotation 
Enrichment Analysis 

Although AEA both successfully corrects for annota- 
tion bias and is conceptually appealing, it assumes that 
annotations made to genes and terms are completely in- 
dependent of one another, ignoring much of the struc- 
ture contained in the Gene Ontology. In this section 
we develop a randomization scheme that preserves the 
structure of GO annotations while calculating the signif- 
icance the number of co-annotations between a gene set 
and a GO branch. This scheme preserves the annotation 
properties of individual genes and terms, thus incorpo- 
rating that structure into the background distribution 
and allowing us to even more accurately assess the sig- 
nificance of overlap between a given gene signature and 
GO branch. We call this approach structure-preserving 
AEA (SP-AEA) and illustrate it in Figure [s) 

In this randomization is it again useful to think of the 
Gene Ontology as a bipartite graph (see Section 1.2.11. 



structure-Preserving 
Annotation Enrichment 
Analysis 



As with AEA, for SP-AEA we begin by determining A/g, 
the number of annotations to a gene set, Mt, the number 
of annotations to the terms in a GO branch, and M^t, the 
number of annotations stretching between this gene set 
and branch. We then determine the expected number of 
co-annotations between this branch and a random sets of 
genes or between this set of genes and a random selection 
of terms. In order to do this, we perform two randomiza- 
tions, one of the order of genes and the other of the order 
of terms, while still preserving the original connections 
from the GO bipartite graph. We then take annotations 
connected to the top random genes until we've selected 
Mg annotations, and determine Mg, the number of edges 
in the bipartite graph that extend between the top ran- 
dom genes and the branch of the GO hierarchy. Simi- 
larly, we take annotations connected to the top random 
terms until we've selected Mt annotations, and determine 
Mt, the number of edges in the bipartite graph that ex- 
tend between the top random terms and the original gene 
set. In the (fairly common) case where selecting the top 
Mg/Mt annotations does not correspond to selecting a 
whole number of genes/terms, we take a random selec- 
tion of edges from the final gene/term selected in order 
to get exactly Mg/Mt annotations. We repeat the ran- 
domization process many times in order to determine a 
distribution of values for Mg given the gene annotations, 
and a distribution of values for Mt given term annota- 
tions. In order to summarize the result, we define a new 
p-value, ps{Mgt) which reflects the signiflcance of the 
Mgt annotations by averaging the the probability that 
Mg > Mgt and the probabihty that Mt > Mgt: 



Ps{Mgt) = ]^P{Mg > Mgt) + ^P{Mt > Mgt). (6) 

We tested the performance of this approximation (us- 
ing 10, 000 randomizations) by determining the func- 
tional enrichment of GO terms in our randomly gener- 




( 1 ) Determine M ^, (2) Randomize 



(1) Determine number of annotations between signature and branch. 

(2) Randomize order of genes or terms, preserving original connections. 

(3) Determine number of annotations between top random genes and 
branch or between top random terms and signature. 

(4) Repeat steps (2)-(3) to build distributions of values. 

(5) Determine pjf/W^ J by averaging the probabilities estimated based 
on these two distributions. 



/■(A/, >M./ 



9 Gene in signature 
O Gene not in signature 
^1 Term in branch 
[ I Term not in branch 



TV - number of genes in signature 

- number of annotations to signature 
A', - number of terms in branch 

- number of annotations to branch 

M ^1 - number of annotations between signature and branch 

M - number of annotation s between top random genes and branch 

M, - number of annotations between top random terms and signature 



EXAMPLE: 

N, =4; M, =12 
«, = 3; M, = 4; M„ = 4 
=3; M, =4 
p,(M„) = l/2 



FIG. 5: An outline of how Structure-Preserving Annotation 
Enrichment Analysis (SP-AEA) calculates the significance of 
association between a given gene signature and the collection 
of terms that belong to a branch in the GO hierarchy. 



ated gene sets (see Section 2.1|. As with AEA, SP-AEA 
effectively eliminates annotation bias (see Supplemental 
Figure 2). Further, the predicted p- values are much more 
reasonable compared to those made by either FET or 
AEA as only 9 gene-set/term pairs have a p-value less 
than 10"" according to SP-AEA compared to 433 in FET 
and over two million in AEA (compare Supplemental Fig- 
ure 2 with Figure 2). This indicates that SP-AEA both 
correctly accounts for annotation bias between various 
gene sets and terms, and also recognizes that these are 
indeed random selections of genes that should have no 
coordinated functional properties. 



Finally, we ran SP-AEA on our experimental gene sig- 
natures (one million randomizations). In order to verify 
that preserving the structure of GO annotations in SP- 
AEA adequately negates the influence of single genes on 
p- values, we determined the top ten term-signature pairs 
enriched by FET and not by SP-AEA and vice versus. 
The pairs enriched in FET and not SP-AEA are the ex- 
act same pairs that were enriched in FET and not AEA 
(see Table 1), however, no signature-term pairs are en- 
riched in SP-AEA at a significance less than 0.01 (and 
only three with a significance less that 0.05) that have a 
p-value greater than 0.5 by FET. Similarly, no pairs are 
predicted as significant by SP-AEA that are not also sig- 
nificant by AEA. P-values as predicted by SP-AEA are 
shown for the term-signature pairs in Table 1. 
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(a)Fisher's Exact Test (b)Annotation Enrichment Analysis (c)Structure-Preserving Annotation 

Enrichment Analysis 



FIG. 6; Clustergrams representing enriched term-signature pairs, (a) A clustering of signatures and terms selected based 
on their enrichment-score according to FET. These signatures (from [1 El [lOl [13| El EH [Ml l37H40l l46l l47l ISTl l52l[56l [57l 
1591 1621 1631 167) ') did not cluster as well into distinct biologically units, (b) A clustering of signatures and terms selected 
based on their enrichment-score according to SP-AEA. These signatures look like they may fall into three clusters generally 
associated with regulation and stimulus response (from [71 1341 1^51 1381 147l 1521 1631 157] ). cycle cycle and metabolic processes (from 
[I0l|l3[l6l|24l[33, 37, 39, 51, 56-58, 61, 62 ) and developmental processes (from [2T1|2B1|40]). This clustering is something not 
apparent using the FET measure but even more apparent using SP-AEA. (c) A clustering of signatures and terms selected based 
on their enrichment-score according to SP-AEA. The signatures and terms break into several, biologically distinct units. One 
includes cellular-differentiation signatures published in [211 1231 l40l I59j . The second is associated with immune-response, and 
includes signatures published in [T] EH E71 |5H1 jSH] HH HTl ESI ■ Another cluster includes breast cancer signatures published 
in [131 [161 [33l|5Tl [56l [58l [6T1|62] . Finally two lymphoma [24l[57] and a viral signature [8] associated with proliferation are also 
included. 



3.5. AEA Reveals Biologically-Relevant Functions 

Finally, we investigated the kind of biology that is 
highlighted using AEA and SP-AEA compared to FET. 
In order to focus on the most relevant biology we se- 
lected 30 representative terms/signatures for each mea- 
sure. We selected these terms/signatures by ranking ac- 
cording to the minimum enrichment each has across all 
signatures/terms and selecting top 30 by this rank. For 
SP-AEA, since many (586) term-signature pairs have an 
estimated p- value of zero even after one million random- 
izations, we broke ties by the number of signatures/terms 
enriched in the terms/signatures at this level. We then 
performed hierarchical clustering (using the "cluster- 
gram" function in Matlab using default settings) on the 
terms and signatures selected for each of the three mea- 
sures. The results are shown in Figure [6j 

Clustering the FET results does not give rise to any ap- 
parent term-signature structure, however, several of the 
individual pairs highlight important biological processes. 
For example, the FET clustering shows an enrichment 
of cell-cycle related processes in breast cancer signatures 
|32| and includes immune-related terms enriched in im- 
mune gene signatures. This, however, accounts for only 
a handful of the selected terms; the clustergram also in- 
cludes a number of functional categories related to "pro- 
teins" and "phosphorylation" that are only enriched in 



a small number of signatures. Although this analysis 
should not be considered evidence that FET produces 
incorrect or unmeaningful results, we do suggest that the 
results presented in the clustergram and as well as in the 
preceding sections indicate that the results of FET can 
be, and often are, muddled by a signal driven by annota- 
tion bias, highlighting either highly-annotated signatures 
or more general biological processes that may not be spe- 
cific to the systems in question. 

This becomes evident in the clustergrams of the terms 
and signatures selected using either AEA or SP-AEA 
where much of the biology, although often present in 
FET, is more obvious because of the removal of the an- 
notation bias. The clearest results are obtained using the 
structure-preserving SP-AEA where several distinct clus- 
ters of signatures and terms emerge that are correlated 
with categories such as cellular-differentiation markers, 
immune-response, and breast cancer. 

The terms associated with each cluster represent bio- 
logical processes that are known to be involved in these 
types of signatures. For example the cluster enriched 
for terms such as "system development" and "develop- 
mental process" includes signatures identified based on 
their role in cellular differentiation. A signature associ- 
ated with homeodomains is included in this cluster. This 
is consistent with the biological role of homeodomains, 
since many are know to initiate cascades of genes, which 
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in turn induces cellular differentiation into tissues and 
organs. The Leukemia signatures [21] were both ob- 
tained by looking at the expression levels from cells where 
CEBPck, a protein known to induced differentiation in 
Leukemia cells [65j, is silenced. The cluster also includes 
a set of oncogenes [23j, as well as two breast cancer sig- 
natures, one of which includes genes involved in angio- 
genesis [55] . 

In contrast, the cluster that primarily includes signa- 
tures from immune-systems, lymphoma and leucocytes, 
is logically also enriched in terms such as "immune sys- 
tem" and "response to stimulus" as well as terms related 
to "biological regulation" . Interestingly, the breast signa- 
ture strongly associated with this cluster [47] represents a 
list of genes defined based on immune response in breast 
cancer and the stem cell signature [7^ is from a study on 
patients with systemic sclerosis, a type of autoimmune 
disorder. In addition, the protein-kinase signature |22j 
included in the cluster is from a study of p58 Mitogen- 
activated protein (MAP) kinase signaling of mRNA sta- 
bility. MAP kinases have been shown to play an impor- 
tant role in immune response [20| . 

Another cluster associated only with breast cancer sig- 
natures shows a strong enrichment for terms related to 
the cell cycle and cellular component organization, pro- 
cesses known to be differentially regulated in breast can- 
cer |32| . Finally, two lymphoma and one viral signa- 
ture that were identified based on cell proliferation (for 
example, by association with Myc targeting [8ll2l]) are 
enriched for terms such as "cellular metabolic process." 
This is consistent with expectation since there is evi- 
dence that a connection exists between proliferation and 
metabolic pathways in cancer cells [HI |60] . 



CONCLUSION 



signatures that, although found by FET, were obscured. 
The limitations of FET in this context is largely at- 
tributable to the fact that many of these published gene- 
signatures include a large number of highly-annotated 
genes that, as a consequence, influences the results of 
functional enrichment analysis. Of course, the associ- 
ation between highly annotated genes and their more 
abundant occurrence in published gene sets may be at 
least partially attributable to the fact that some of these 
same publications may have been used in assigning a sub- 
set of the annotations in GO. We point out that although 
it is possible that newly-derived gene signatures may not 
exhibit the same level of annotation-bias as previously- 
published signatures, it is also very probable that highly 
annotated genes are important in a wide variety of well- 
studied systems and will continue to show up and in- 
fluence the results of functional enrichment analysis on 
newly generated gene sets. In light of this we suggest us- 
ing our approach alongside or in place of other traditional 
measures, especially for gene signatures that are known 
to contain significantly more or less annotations than one 
would expect by chance. We believe that AEA will allow 
biologists to better interpret the functional roles of genes 
identified as important in their experimental system. 
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